library(lubridate)
library(knitr)
library(viridis)
library(lme4)
library(lmerTest)
library(kableExtra)
library(MLmetrics)
library(caret)
library(tseries)
library(forecast)
#library(tidymodels)
suppressPackageStartupMessages(library(tidyverse))
set.seed(1)
df <- read_tsv("../data/Question2.tsv")
print(dim(df))
## [1] 981 3
kbl(head(df), 'html', align = 'r') %>% kable_styling()
| Date | Cost | Revenue |
|---|---|---|
| 2019-08-20 | $4,214.69 | $14,349.50 |
| 2019-08-21 | $2,627.13 | $13,463.13 |
| 2019-08-22 | $3,196.08 | $10,297.64 |
| 2019-08-23 | $1,134.45 | $6,296.62 |
| 2019-08-24 | $3,207.85 | $10,291.51 |
| 2019-08-25 | $7,325.59 | $24,733.41 |
NA and duplicatesdf %>%
summarise(across(everything(), list(zeros = ~sum(is.na(.)))),
duplicated.days = sum(duplicated(Date))) %>%
kbl('html') %>% kable_styling()
| Date_zeros | Cost_zeros | Revenue_zeros | duplicated.days |
|---|---|---|---|
| 0 | 0 | 0 | 1 |
There is one duplicated day, and there are no NAs
df %>%
filter(Date == df$Date[duplicated(df$Date)]) %>%
kbl('html') %>% kable_styling()
| Date | Cost | Revenue |
|---|---|---|
| 2021-09-03 | $5,432.35 | $20,120.06 |
| 2021-09-03 | $5,432.35 | $20,120.06 |
The above shows that the corresponding Cost and
Revenue are duplicated for the duplicated date, so just
delete one instance
df = df %>% filter(!duplicated(Date))
Cost and Revenue, which read
in as character/string data, to numeric valuesday, month,
and year, as well as “year day” and “month day” from the
Date column to aid in visualization and possibly
modeling.Profit variable, estimated as \(Revenue - Cost\)df1 = df %>%
mutate(
# convert the cost and revenue columns from character strings of the format
# $ [n-digit dollars] . [two-digit cents]
across(c(Cost, Revenue), ~{
dollars = as.numeric(gsub('$|[[:punct:]]', '', substr(., 1, nchar(.) - 2)))
cents = substr(., nchar(.) - 1, nchar(.))
as.numeric(paste(dollars, cents, sep = '.'))
}),
# From the date column, it may be handy to have the day, the month, year,
# the day within year (from 0 - 365 excluding leap years), day w/in month
day = day(Date), month = month(Date), year = factor(year(Date)),
yday = yday(Date), mday=mday(Date), # day w/in year and month
# For visualization it can be helpful to have just the month-day (no year)
month_and_day = as.factor(sub('.{5}', '', Date)),
# Code the variable profit, estimated by Revenue - cost
#Profit = Revenue - Cost, efficiency = Revenue / Cost)
Profit = Revenue - Cost)
head(df1) %>%
kbl('html') %>% kable_styling()
| Date | Cost | Revenue | day | month | year | yday | mday | month_and_day | Profit |
|---|---|---|---|---|---|---|---|---|---|
| 2019-08-20 | 4214.69 | 14349.50 | 20 | 8 | 2019 | 232 | 20 | 08-20 | 10134.81 |
| 2019-08-21 | 2627.13 | 13463.13 | 21 | 8 | 2019 | 233 | 21 | 08-21 | 10836.00 |
| 2019-08-22 | 3196.08 | 10297.64 | 22 | 8 | 2019 | 234 | 22 | 08-22 | 7101.56 |
| 2019-08-23 | 1134.45 | 6296.62 | 23 | 8 | 2019 | 235 | 23 | 08-23 | 5162.17 |
| 2019-08-24 | 3207.85 | 10291.51 | 24 | 8 | 2019 | 236 | 24 | 08-24 | 7083.66 |
| 2019-08-25 | 7325.59 | 24733.41 | 25 | 8 | 2019 | 237 | 25 | 08-25 | 17407.82 |
Revenueggplot(df1) +
geom_density(aes(x = Revenue, fill = year), color='black', alpha = .5, linewidth=.5) +
theme_minimal() +
scale_fill_viridis_d() +
facet_grid(rows = vars(year))
This variable shows a high degree of positive skew that is pretty consistent across years. Let’s view it on a log scale:
ggplot(df1) +
geom_density(aes(x = Revenue, fill = year), color='black', alpha = .5, linewidth=.5) +
theme_minimal() +
scale_fill_viridis_d() +
facet_grid(rows = vars(year)) +
scale_x_log10()
Transforming the axes (and implicitly visualizing
log(Revenue)) reveals a trend across years, and also
normalizes the data fairly well. Overall result on the variable after
log-transformation:
df1 %>%
ggplot +
geom_density(aes(x = log(Revenue)),
fill='skyblue', color='black', alpha = .5, linewidth=.5) +
theme_minimal()
# Format data for plotting
plt.dat = df1 %>%
mutate(`log(Revenue)` = log(Revenue), `log(Profit)` = log(Profit))
# plot of: Revenue (log transformed)
plt.dat %>%
ggplot +
geom_line(aes(x = yday, y = `log(Revenue)`, color = year, group = year)) +
# Label just the first day of each month:
scale_x_continuous(
name = 'Month',
breaks = as.numeric(plt.dat$yday)[(plt.dat$mday == 1)&(plt.dat$year==2021)],
labels = month(1:12, label = T)
) +
# set relevant y-axis
scale_y_continuous(limits = c(8,12.25)) +
theme_minimal()
# plot of: Profit = Revenue - Cost (log transformed)
plt.dat %>%
ggplot +
geom_line(aes(x = yday, y = `log(Profit)`, color = year, group = year)) +
# Label just the first day of each month:
scale_x_continuous(
name = 'Month',
breaks = as.numeric(plt.dat$yday)[(plt.dat$mday == 1)&(plt.dat$year==2021)],
labels = month(1:12, label = T)
) +
# set relevant y-axis
scale_y_continuous(limits = c(8,12.25)) +
theme_minimal()
For a year that we have complete data, 2021 shows the most revenue, and this appears to be a yearly trend such that 2019 < 2020 < 2021 < 2022.
All revenue (in log units) appears to increase approximately monotonically from January to the holiday season, such that, each month is higher on average than the previous. However a sharp decline starts mid December to make this rule inaccurate overall.
2020 appears to outlie relative to other years. It shows a lot more variability in the slow-frequency trend, with a surprisingly high Jan - Mar, followed by a large drop and then a big rebound. The dip and rebound may be due to the start of the COVID-19 pandemic, or noise.
Cost and
Revenueggplot(df1, aes(x = log(Cost), y = log(Revenue))) +
geom_point() +
geom_smooth(method = 'lm', formula = 'y~x') +
facet_wrap(~year,scales = 'fixed') +
theme_minimal()
This graph shows that despite the long positive skew in the distributions of Cost and Revenue, the relationship between Cost and Revenue is reliable even at extreme values of both. There appears to be a strong correlation between Cost and Revenue, and the relationship between Cost and log-transformed Revenue appears to be linear. As such, it appears appropriate to model log(Revenue) going forward. Finally, this trend appears consistent between across years, however, albeit based on fewer data points Cost appears to be less effective in generating Revenue in 2022.
df1 %>%
pivot_longer(cols = c('Revenue', 'Cost'), names_to = 'name', values_to = 'value') %>%
ggplot +
geom_line(aes(yday, y = log(value), color = name, group = name)) +
scale_x_continuous(
name = 'Month',
breaks = as.numeric(df1$yday)[(df1$mday == 1)&(df1$year==2021)],
labels = month(1:12, label = T),
) +
scale_y_continuous(limits = c(6.5,12.25)) +
facet_grid(rows = vars(year)) +
theme_bw()
(ct = cor.test(df1$Cost, df1$Revenue, method = 'spearman'))
##
## Spearman's rank correlation rho
##
## data: df1$Cost and df1$Revenue
## S = 24072006, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8465433
The dynamic coupling between Cost and Revenue is significant, \(\rho_{spearman}\) = 0.85 , p < .0001.
Furthermore, Cost seems to predict both short term
fluctuations, and long term trends in Revenue such that seasonal trends
may be redundant.
df1 %>%
ggplot +
geom_line(aes(x = Date, y = log(Revenue))) +
theme_bw()
The plot suggests that a yearly seasonal trend arises, which resets mid-December each year.
adf.test(df1$Revenue)
## Warning in adf.test(df1$Revenue): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: df1$Revenue
## Dickey-Fuller = -4.1153, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary
This suggests the time series is stationary and can be appropriately modeled using an ARIMA-like model (ARIMA or ARIMAX)
Conclusions from EDA and graphical analysis
df2 = df1 %>%
# Log transform Revenue and Cost
mutate(log.Revenue = log(Revenue),
log.Cost = log(Cost))
# Split December into two parts, before and after the peak (for that year)
for (y in unique(df2$year)) {
dec.data = df2 %>% filter(month == 12, year == y)
if (nrow(dec.data) > 0) { # if this year has december data
dec.data[-1:-which.max(dec.data$Cost), 'month'] = 13
df2[(df2$year == y) & (df2$month == 12), 'month'] = dec.data$month
}
}
# Store month as an ordered factor
df2 = df2 %>% mutate(across(month, ordered))
df3 = df2 %>%
# To aid interpretability, and since we're not using OLS,
select(log.Revenue, log.Cost, Revenue, Cost) %>%
ts # convert to time series using ts()
head(df3)
## Time Series:
## Start = 1
## End = 6
## Frequency = 1
## log.Revenue log.Cost Revenue Cost
## 1 9.571470 8.346331 14349.50 4214.69
## 2 9.507710 7.873647 13463.13 2627.13
## 3 9.239670 8.069680 10297.64 3196.08
## 4 8.747768 7.033903 6296.62 1134.45
## 5 9.239075 8.073356 10291.51 3207.85
## 6 10.115910 8.899129 24733.41 7325.59
Note, the first model I fit provided stationary = TRUE,
implicitly fitting an ARIMA model, but the residuals were
auto-correlated. The SARIMA model held better to the assumption that
residuals are independent. Finally, evidence for independence of
residuals was highest when using raw Revenue and Cost, rather than their
log transformations.
mod <- auto.arima(df3[, "Revenue"], xreg = df3[, "Cost"], stationary = F)
# Check model fit
mod
## Series: df3[, "Revenue"]
## Regression with ARIMA(1,1,2) errors
##
## Coefficients:
## ar1 ma1 ma2 xreg
## -0.6280 0.0417 -0.4760 3.7673
## s.e. 0.1358 0.1291 0.0758 0.0744
##
## sigma^2 = 32821109: log likelihood = -9858.98
## AIC=19727.96 AICc=19728.02 BIC=19752.39
The best fitting model suggests autocorrelation and non-stationarity,
as well as significant explanatory/predictive valuef
Cost.
plt.dat %>%
mutate(predicted = mod$fitted) %>%
pivot_longer(cols = c('Revenue', 'predicted'), names_to = 'name', values_to = 'value') %>%
ggplot +
geom_line(aes(yday, y = log(value), color = name, group = name)) +
scale_x_continuous(
name = 'Month',
breaks = as.numeric(df1$yday)[(df1$mday == 1)&(df1$year==2021)],
labels = month(1:12, label = T),
) +
scale_y_continuous(limits = c(6.5,12.25)) +
facet_grid(rows = vars(year)) +
theme_bw()
The model appears to fit very well with just Cost as a predictor
checkresiduals(mod)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(1,1,2) errors
## Q* = 11.67, df = 7, p-value = 0.1119
##
## Model df: 3. Total lags used: 10
The first day of the week we spend $17,007.92, how would you allocate the remaining budget throughout the week to try to maximize the amount of Revenue that we get for that channel?
budget = 116000 - 17007.92
new_data = tibble(
uniform = rep(1/6, times = 6),
increasing = c(.05, .05, .1, .15, .3, .35),
decreasing = rev(increasing),
max1 = ifelse(1:6 == 1, .80, (1 - .80)/5),
max2 = ifelse(1:6 == 2, .80, (1 - .80)/5), # one day takes majority of expenditure
max3 = ifelse(1:6 == 3, .80, (1 - .80)/5),
max4 = ifelse(1:6 == 4, .80, (1 - .80)/5),
max5 = ifelse(1:6 == 5, .80, (1 - .80)/5),
max6 = ifelse(1:6 == 6, .80, (1 - .80)/5)
)
all(sapply(new_data, sum) == 1) # they should integrate to 1
## [1] TRUE
# Different expenditure patterns:
(new_data = new_data %>% mutate(across(everything(), ~.*budget)))
## # A tibble: 6 × 9
## uniform increasing decreasing max1 max2 max3 max4 max5 max6
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 16499. 4950. 34647. 79194. 3960. 3960. 3960. 3960. 3960.
## 2 16499. 4950. 29698. 3960. 79194. 3960. 3960. 3960. 3960.
## 3 16499. 9899. 14849. 3960. 3960. 79194. 3960. 3960. 3960.
## 4 16499. 14849. 9899. 3960. 3960. 3960. 79194. 3960. 3960.
## 5 16499. 29698. 4950. 3960. 3960. 3960. 3960. 79194. 3960.
## 6 16499. 34647. 4950. 3960. 3960. 3960. 3960. 3960. 79194.
Plots show the forecasted predictions, and provide the cumulative revenue for the forecasted week.
predictions = map(new_data, ~{forecast(mod, xreg = .x)})
map(1:length(predictions), ~{
p = predictions[[.x]]
cond = names(new_data)[.x]
revenue = round(sum(as.vector(p$mean)), 2)
plt.t = paste0('Total Forecasted Revenue from 4/26/2022 to 5/1/2022
condition: ', cond, '; revenue: ', revenue)
autoplot(p) +
coord_cartesian(xlim = c(nrow(df3) - 90, nrow(df3) + 6)) +
ggtitle(plt.t) +
theme_minimal()
})
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
lapply(predictions[-1:-3], function(x) x$mean) %>%
as.data.frame %>%
as.matrix %>%
diag %>%
{data.frame(Date = tail(df2$Date,1) + 1:6, Revenue = .)} %>%
arrange(desc(Revenue)) %>%
kbl('html') %>% kable_styling()
| Date | Revenue |
|---|---|
| 2022-04-26 | 280490.9 |
| 2022-04-28 | 280091.3 |
| 2022-04-30 | 279933.8 |
| 2022-05-01 | 279766.7 |
| 2022-04-29 | 279667.7 |
| 2022-04-27 | 279416.8 |
The total revenue predictions are all the same because they are based solely on and the effect size of Cost in the model was small relative to the units of Revenue. However, a closer inspection reveals that holding all else constant, spending on this channel on 4/26/2022 will lead to the most Revenue.
minimum_cost = 100
while(T) {
revenue = forecast(mod, xreg = minimum_cost)$mean
if (revenue > 0) break else minimum_cost = minimum_cost + 10
}
data.frame(minimum_cost = minimum_cost, revenue = revenue)
## minimum_cost revenue
## 1 4740 1.837349
In addition, predicted revenue is in the red unless costs reach at least $4740. As such, it would be inadvisable to spend any less than $4,740 in the channel on our most advantageous day (4/26/2022)
Viewing the four scatterplots of the bivariate relationship between
Cost and Revenue within each year reveals that
each year. Additionally, the slope of the regression line predicting
Revenue from Cost seems to vary by year, suggesting a possible
interaction. Additionally, to account for the historical data obtained
for each month, we can assume each month to have a random intercept.
These aspects of the data can be modeled using multilevel models, with
random effects for various timing variables (e.g., year or month), in
addition to the effect of Cost.
mem1 = lmer(log.Revenue ~ (1|month) + log(Cost)*year, data = df2)
mem2 = lmer(log.Revenue ~ (1|year) + log(Cost), data = df2)
summary(mem1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log.Revenue ~ (1 | month) + log(Cost) * year
## Data: df2
##
## REML criterion at convergence: 131.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2849 -0.5698 0.0497 0.6316 4.0875
##
## Random effects:
## Groups Name Variance Std.Dev.
## month (Intercept) 0.01661 0.1289
## Residual 0.06219 0.2494
## Number of obs: 980, groups: month, 13
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 2.29297 0.33859 961.70169 6.772 2.21e-11 ***
## log(Cost) 0.92784 0.04180 971.99989 22.196 < 2e-16 ***
## year2020 2.17702 0.38600 971.99229 5.640 2.23e-08 ***
## year2021 -0.41698 0.39446 971.26037 -1.057 0.291
## year2022 2.53502 0.51455 971.98987 4.927 9.83e-07 ***
## log(Cost):year2020 -0.25612 0.04740 971.61705 -5.403 8.25e-08 ***
## log(Cost):year2021 0.03524 0.04754 971.92870 0.741 0.459
## log(Cost):year2022 -0.30298 0.06101 971.96645 -4.966 8.06e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) lg(Cs) yr2020 yr2021 yr2022 l(C):2020 l(C):2021
## log(Cost) -0.992
## year2020 -0.818 0.809
## year2021 -0.803 0.794 0.788
## year2022 -0.640 0.639 0.570 0.580
## lg(Cs):2020 0.826 -0.822 -0.997 -0.784 -0.573
## lg(Cs):2021 0.829 -0.826 -0.793 -0.996 -0.592 0.795
## lg(Cs):2022 0.672 -0.676 -0.578 -0.584 -0.995 0.585 0.602
summary(mem2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log.Revenue ~ (1 | year) + log(Cost)
## Data: df2
##
## REML criterion at convergence: 383.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4195 -0.6190 -0.0709 0.6385 3.5354
##
## Random effects:
## Groups Name Variance Std.Dev.
## year (Intercept) 0.004903 0.07002
## Residual 0.084863 0.29131
## Number of obs: 980, groups: year, 4
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 3.15923 0.13081 282.33818 24.15 <2e-16 ***
## log(Cost) 0.82103 0.01505 935.79499 54.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## log(Cost) -0.960
While these models do show significant effects of year and month, there’s not much evidence to treat these aspects as random effects, due to the low variability between months and years relative to their standard errors. This suggests that we should stick with the ARIMA model
Based on the model forecasts, the week following April 25th 2022 I would suggest that which days within this week don’t matter as much as how much is spent on this channel. To that end, I would suggest using whatever is budgeted for this week. There is minor evidence that allocating expenditures to 4/26 and 4/28 is the most advantageous, followed by 4/30, 5/1, 4/29 and 4/27, however, the apparent differences in expected revenue across days may may be due to noise in the model and indicative of extrapolation error, so the latter suggestion should be treated with caution.
How much Revenue do you predict that we will make in that channel for the given week with the given budget?
Based on the first model, I predict the week will return $17,007.92 + $262,223.33 for the remaining days, totaling $279,231.20.
One important limitation is that these models were not cross-validated. Therefore the predictions they generate might be due to overfitting. Since we have extensive and dense historical data, this may not be a huge concern in terms of our ultimate predictions about Revenue. However, cross-validation can be a useful tool for model comparison. To that extent, we would be able to assess if the predictions from the multilevel models are inflated relative to the SARIMAX model.